To practice SQL querying, I connect to a Google Big Query Database and do some routine querying tasks. We have a database named Test with 3 tables: Inventory, Products, and Sales. Inventory contains ProductId, StoreId, StoreName, Address, neighborhood, QuantityAvailable. Products contains ProductId, ProductName, Supplier, ProductCost. Sales Contains SalesId, StoreId, ProductId, UnitPrice, Quantity.

Install Packages

install.packages('bigrquery', repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/Steve/AppData/Local/R/win-library/4.2'
## (as 'lib' is unspecified)
## package 'bigrquery' successfully unpacked and MD5 sums checked
## Warning: cannot remove prior installation of package 'bigrquery'
## Warning in file.copy(savedcopy, lib, recursive = TRUE):
## problem copying C:\Users\Steve\AppData\Local\R\win-
## library\4.2\00LOCK\bigrquery\libs\x64\bigrquery.dll to C:
## \Users\Steve\AppData\Local\R\win-library\4.2\bigrquery\libs\x64\bigrquery.dll:
## Permission denied
## Warning: restored 'bigrquery'
## 
## The downloaded binary packages are in
##  C:\Users\Steve\AppData\Local\Temp\RtmpMzGzRg\downloaded_packages
install.packages('dplyr', repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/Steve/AppData/Local/R/win-library/4.2'
## (as 'lib' is unspecified)
## package 'dplyr' successfully unpacked and MD5 sums checked
## Warning: cannot remove prior installation of package 'dplyr'
## Warning in file.copy(savedcopy, lib, recursive = TRUE): problem copying C:
## \Users\Steve\AppData\Local\R\win-library\4.2\00LOCK\dplyr\libs\x64\dplyr.dll
## to C:\Users\Steve\AppData\Local\R\win-library\4.2\dplyr\libs\x64\dplyr.dll:
## Permission denied
## Warning: restored 'dplyr'
## 
## The downloaded binary packages are in
##  C:\Users\Steve\AppData\Local\Temp\RtmpMzGzRg\downloaded_packages
library(bigrquery)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Connect to Big Query

con <- dbConnect(
  bigrquery::bigquery(),
  project = "my-data-project-366100",
  dataset = "Test",
  billing = "my-data-project-366100"
)

Show Tables in Database

dbListTables(con)
## ! Using an auto-discovered, cached token.
##   To suppress this message, modify your code or options to clearly consent to
##   the use of a cached token.
##   See gargle's "Non-interactive auth" vignette for more details:
##   <]8;;https://gargle.r-lib.org/articles/non-interactive-auth.htmlhttps://gargle.r-lib.org/articles/non-interactive-auth.html]8;;>
## ℹ The bigrquery package is using a cached token for ']8;;mailto:wssitu@gmail.comwssitu@gmail.com]8;;'.
## [1] "Inventory" "Products"  "Sales"

We Create a SQL query that does a FULL OUTER join to create a single table

Select *
FROM  Test.Inventory
FULL JOIN Test.Products 
ON Inventory.ProductID = Products.ProductID 
FULL JOIN Test.Sales
On Inventory.ProductID = Sales.ProductID
LIMIT 20
print(M1_results)
## # A tibble: 20 × 16
##    ProductId StoreId StoreName Address   neigh…¹ Quant…² Produ…³ Produ…⁴ Suppl…⁵
##        <int>   <int> <chr>     <chr>     <chr>     <int>   <int> <chr>   <chr>  
##  1       828   22623 Gordmans  0 Vermon… Hampden       4     828 Pepper… Gordma…
##  2       828   22623 Gordmans  0 Vermon… Hampden       4     828 Pepper… Gordma…
##  3       828   22623 Gordmans  0 Vermon… Hampden       4     828 Pepper… Gordma…
##  4       828   22623 Gordmans  0 Vermon… Hampden       4     828 Pepper… Gordma…
##  5       828   22623 Gordmans  0 Vermon… Hampden       4     828 Pepper… Gordma…
##  6       828   22623 Gordmans  0 Vermon… Hampden       4     828 Pepper… Gordma…
##  7       828   22623 Gordmans  0 Vermon… Hampden       4     828 Pepper… Gordma…
##  8       828   22623 Gordmans  0 Vermon… Hampden       4     828 Pepper… Gordma…
##  9       828   22623 Gordmans  0 Vermon… Hampden       4     828 Pepper… Gordma…
## 10       828   22623 Gordmans  0 Vermon… Hampden       4     828 Pepper… Gordma…
## 11       828   22623 Gordmans  0 Vermon… Hampden       4     828 Pepper… Gordma…
## 12       828   22623 Gordmans  0 Vermon… Hampden       4     828 Pepper… Gordma…
## 13       828   22623 Gordmans  0 Vermon… Hampden       4     828 Pepper… Gordma…
## 14       828   22623 Gordmans  0 Vermon… Hampden       4     828 Pepper… Gordma…
## 15       828   22623 Gordmans  0 Vermon… Hampden       4     828 Pepper… Gordma…
## 16       828   22623 Gordmans  0 Vermon… Hampden       4     828 Pepper… Gordma…
## 17       828   22623 Gordmans  0 Vermon… Hampden       4     828 Pepper… Gordma…
## 18       828   22623 Gordmans  0 Vermon… Hampden       4     828 Pepper… Gordma…
## 19       828   22623 Gordmans  0 Vermon… Hampden       4     828 Pepper… Gordma…
## 20       828   22623 Gordmans  0 Vermon… Hampden       4     828 Pepper… Gordma…
## # … with 7 more variables: ProductCost <dbl>, SalesId <int>, StoreId_1 <int>,
## #   ProductId_2 <int>, Date <date>, UnitPrice <dbl>, Quantity <int>, and
## #   abbreviated variable names ¹​neighborhood, ²​QuantityAvailable, ³​ProductId_1,
## #   ⁴​ProductName, ⁵​Supplier

Using the single table, we query off of it by putting it in a where clause to create a subquery

Select *
FROM  Test.Inventory
FULL JOIN Test.Products 
ON Inventory.ProductID = Products.ProductID 
FULL JOIN Test.Sales
On Inventory.ProductID = Sales.ProductID
LIMIT 20
print(M2_results)
## # A tibble: 20 × 16
##    ProductId StoreId StoreName Address   neigh…¹ Quant…² Produ…³ Produ…⁴ Suppl…⁵
##        <int>   <int> <chr>     <chr>     <chr>     <int>   <int> <chr>   <chr>  
##  1       828   22623 Gordmans  0 Vermon… Hampden       4     828 Pepper… Gordma…
##  2       828   22623 Gordmans  0 Vermon… Hampden       4     828 Pepper… Gordma…
##  3       828   22623 Gordmans  0 Vermon… Hampden       4     828 Pepper… Gordma…
##  4       828   22623 Gordmans  0 Vermon… Hampden       4     828 Pepper… Gordma…
##  5       828   22623 Gordmans  0 Vermon… Hampden       4     828 Pepper… Gordma…
##  6       828   22623 Gordmans  0 Vermon… Hampden       4     828 Pepper… Gordma…
##  7       828   22623 Gordmans  0 Vermon… Hampden       4     828 Pepper… Gordma…
##  8       828   22623 Gordmans  0 Vermon… Hampden       4     828 Pepper… Gordma…
##  9       828   22623 Gordmans  0 Vermon… Hampden       4     828 Pepper… Gordma…
## 10       828   22623 Gordmans  0 Vermon… Hampden       4     828 Pepper… Gordma…
## 11       828   22623 Gordmans  0 Vermon… Hampden       4     828 Pepper… Gordma…
## 12       828   22623 Gordmans  0 Vermon… Hampden       4     828 Pepper… Gordma…
## 13       828   22623 Gordmans  0 Vermon… Hampden       4     828 Pepper… Gordma…
## 14       828   22623 Gordmans  0 Vermon… Hampden       4     828 Pepper… Gordma…
## 15       828   22623 Gordmans  0 Vermon… Hampden       4     828 Pepper… Gordma…
## 16       828   22623 Gordmans  0 Vermon… Hampden       4     828 Pepper… Gordma…
## 17       828   22623 Gordmans  0 Vermon… Hampden       4     828 Pepper… Gordma…
## 18       828   22623 Gordmans  0 Vermon… Hampden       4     828 Pepper… Gordma…
## 19       828   22623 Gordmans  0 Vermon… Hampden       4     828 Pepper… Gordma…
## 20       828   22623 Gordmans  0 Vermon… Hampden       4     828 Pepper… Gordma…
## # … with 7 more variables: ProductCost <dbl>, SalesId <int>, StoreId_1 <int>,
## #   ProductId_2 <int>, Date <date>, UnitPrice <dbl>, Quantity <int>, and
## #   abbreviated variable names ¹​neighborhood, ²​QuantityAvailable, ³​ProductId_1,
## #   ⁴​ProductName, ⁵​Supplier

We Check for nulls on StoreName column

Select *
FROM (Select *
FROM  Test.Inventory
FULL JOIN Test.Products 
ON Inventory.ProductID = Products.ProductID 
FULL JOIN Test.Sales
On Inventory.ProductID = Sales.ProductID)
WHERE StoreName IS NULL
LIMIT 20
print(M3_results)
## # A tibble: 0 × 16
## # … with 16 variables: ProductId <int64>, StoreId <int64>, StoreName <chr>,
## #   Address <chr>, neighborhood <chr>, QuantityAvailable <int64>,
## #   ProductId_1 <int64>, ProductName <chr>, Supplier <chr>, ProductCost <dbl>,
## #   SalesId <int64>, StoreId_1 <int64>, ProductId_2 <int64>, Date <date>,
## #   UnitPrice <dbl>, Quantity <int64>

We Count unique values of StoreName

Select COUNT(DISTINCT StoreName) AS distinct_count
FROM (Select *
FROM  Test.Inventory
FULL JOIN Test.Products 
ON Inventory.ProductID = Products.ProductID 
FULL JOIN Test.Sales
On Inventory.ProductID = Sales.ProductID)
LIMIT 20
print(M4_results)
## # A tibble: 1 × 1
##   distinct_count
##            <int>
## 1             34

We use some basic conditional statements like finding only rows where Storename = Walmart

Select *
FROM (Select *
FROM  Test.Inventory
FULL JOIN Test.Products 
ON Inventory.ProductID = Products.ProductID 
FULL JOIN Test.Sales
On Inventory.ProductID = Sales.ProductID)
WHERE StoreName = "Walmart"
LIMIT 20
print(M5_results)
## # A tibble: 20 × 16
##    ProductId StoreId StoreName Address   neigh…¹ Quant…² Produ…³ Produ…⁴ Suppl…⁵
##        <int>   <int> <chr>     <chr>     <chr>     <int>   <int> <chr>   <chr>  
##  1        29   21777 Walmart   2 Laurel… Sabina…       5      29 Soup -… Walmart
##  2        29   21777 Walmart   2 Laurel… Sabina…       5      29 Soup -… Walmart
##  3        29   21777 Walmart   2 Laurel… Sabina…       5      29 Soup -… Walmart
##  4        29   21777 Walmart   2 Laurel… Sabina…       5      29 Soup -… Walmart
##  5        29   21777 Walmart   2 Laurel… Sabina…       5      29 Soup -… Walmart
##  6        29   21777 Walmart   2 Laurel… Sabina…       5      29 Soup -… Walmart
##  7        29   21777 Walmart   2 Laurel… Sabina…       5      29 Soup -… Walmart
##  8        29   21777 Walmart   2 Laurel… Sabina…       5      29 Soup -… Walmart
##  9        29   21777 Walmart   2 Laurel… Sabina…       5      29 Soup -… Walmart
## 10        29   21777 Walmart   2 Laurel… Sabina…       5      29 Soup -… Walmart
## 11        29   21777 Walmart   2 Laurel… Sabina…       5      29 Soup -… Walmart
## 12        29   21777 Walmart   2 Laurel… Sabina…       5      29 Soup -… Walmart
## 13        29   21777 Walmart   2 Laurel… Sabina…       5      29 Soup -… Walmart
## 14        29   21777 Walmart   2 Laurel… Sabina…       5      29 Soup -… Walmart
## 15        29   21777 Walmart   2 Laurel… Sabina…       5      29 Soup -… Walmart
## 16        29   21777 Walmart   2 Laurel… Sabina…       5      29 Soup -… Walmart
## 17        29   21777 Walmart   2 Laurel… Sabina…       5      29 Soup -… Walmart
## 18        29   21777 Walmart   2 Laurel… Sabina…       5      29 Soup -… Walmart
## 19        29   21777 Walmart   2 Laurel… Sabina…       5      29 Soup -… Walmart
## 20        29   21777 Walmart   2 Laurel… Sabina…       5      29 Soup -… Walmart
## # … with 7 more variables: ProductCost <dbl>, SalesId <int>, StoreId_1 <int>,
## #   ProductId_2 <int>, Date <date>, UnitPrice <dbl>, Quantity <int>, and
## #   abbreviated variable names ¹​neighborhood, ²​QuantityAvailable, ³​ProductId_1,
## #   ⁴​ProductName, ⁵​Supplier

Quantity Available >= 5

Select *
FROM (Select *
FROM  Test.Inventory
FULL JOIN Test.Products 
ON Inventory.ProductID = Products.ProductID 
FULL JOIN Test.Sales
On Inventory.ProductID = Sales.ProductID)
WHERE QuantityAvailable >= 5
LIMIT 20
print(M6_results)
## # A tibble: 20 × 16
##    ProductId StoreId StoreName Address   neigh…¹ Quant…² Produ…³ Produ…⁴ Suppl…⁵
##        <int>   <int> <chr>     <chr>     <chr>     <int>   <int> <chr>   <chr>  
##  1        29   21777 Walmart   2 Laurel… Sabina…       5      29 Soup -… Walmart
##  2        29   21777 Walmart   2 Laurel… Sabina…       5      29 Soup -… Walmart
##  3        29   21777 Walmart   2 Laurel… Sabina…       5      29 Soup -… Walmart
##  4        29   21777 Walmart   2 Laurel… Sabina…       5      29 Soup -… Walmart
##  5        29   21777 Walmart   2 Laurel… Sabina…       5      29 Soup -… Walmart
##  6        29   21777 Walmart   2 Laurel… Sabina…       5      29 Soup -… Walmart
##  7        29   21777 Walmart   2 Laurel… Sabina…       5      29 Soup -… Walmart
##  8        29   21777 Walmart   2 Laurel… Sabina…       5      29 Soup -… Walmart
##  9        29   21777 Walmart   2 Laurel… Sabina…       5      29 Soup -… Walmart
## 10        29   21777 Walmart   2 Laurel… Sabina…       5      29 Soup -… Walmart
## 11        29   21777 Walmart   2 Laurel… Sabina…       5      29 Soup -… Walmart
## 12        29   21777 Walmart   2 Laurel… Sabina…       5      29 Soup -… Walmart
## 13        29   21777 Walmart   2 Laurel… Sabina…       5      29 Soup -… Walmart
## 14        29   21777 Walmart   2 Laurel… Sabina…       5      29 Soup -… Walmart
## 15        29   21777 Walmart   2 Laurel… Sabina…       5      29 Soup -… Walmart
## 16        29   21777 Walmart   2 Laurel… Sabina…       5      29 Soup -… Walmart
## 17        29   21777 Walmart   2 Laurel… Sabina…       5      29 Soup -… Walmart
## 18        29   21777 Walmart   2 Laurel… Sabina…       5      29 Soup -… Walmart
## 19        29   21777 Walmart   2 Laurel… Sabina…       5      29 Soup -… Walmart
## 20        29   21777 Walmart   2 Laurel… Sabina…       5      29 Soup -… Walmart
## # … with 7 more variables: ProductCost <dbl>, SalesId <int>, StoreId_1 <int>,
## #   ProductId_2 <int>, Date <date>, UnitPrice <dbl>, Quantity <int>, and
## #   abbreviated variable names ¹​neighborhood, ²​QuantityAvailable, ³​ProductId_1,
## #   ⁴​ProductName, ⁵​Supplier

Creating a new column for total revenue as UnitPrice * Quantity

Select UnitPrice, Quantity, UnitPrice * Quantity AS total_revenue
FROM (Select *
FROM  Test.Inventory
FULL JOIN Test.Products 
ON Inventory.ProductID = Products.ProductID 
FULL JOIN Test.Sales
On Inventory.ProductID = Sales.ProductID)
LIMIT 20
M7_results
## # A tibble: 20 × 3
##    UnitPrice Quantity total_revenue
##        <dbl>    <int>         <dbl>
##  1     0.507        9          4.57
##  2     0.507       40         20.3 
##  3     0.507       90         45.7 
##  4     0.507       15          7.61
##  5     0.507       40         20.3 
##  6     0.507       85         43.1 
##  7     0.507       20         10.1 
##  8     0.507       37         18.8 
##  9     0.507       43         21.8 
## 10     0.507       47         23.9 
## 11     0.507       12          6.09
## 12     0.507       53         26.9 
## 13     0.507       28         14.2 
## 14     0.507       69         35.0 
## 15     0.507       60         30.4 
## 16     0.507       55         27.9 
## 17     0.507       13          6.60
## 18     0.507       58         29.4 
## 19     0.507       12          6.09
## 20     0.507       46         23.3

We use a group by statement with aggregate function average to find the average quantities sold of a product

grouped by StoreName and Product Name

Select StoreName, ProductName, AVG(Quantity) AS average_quantity
FROM (Select *
FROM  Test.Inventory
FULL JOIN Test.Products 
ON Inventory.ProductID = Products.ProductID 
FULL JOIN Test.Sales
On Inventory.ProductID = Sales.ProductID)
GROUP BY StoreName, ProductName
ORDER BY average_quantity DESC
LIMIT 20
M8_results
## # A tibble: 20 × 3
##    StoreName              ProductName                  average_quantity
##    <chr>                  <chr>                                   <dbl>
##  1 HomeGoods              Five Alive Citrus                        55.6
##  2 Family Dollar          Toamtoes 6x7 Select                      55.5
##  3 Renys                  Truffle Paste                            55.3
##  4 HomeSense              Kumquat                                  55.3
##  5 Family Dollar          Cakes Assorted                           55.2
##  6 Target                 Cheese - Cream Cheese                    55.2
##  7 Tuesday Morning        Bag Clear 10 Lb                          55.1
##  8 T.J. Maxx              Beef - Kobe Striploin                    55.0
##  9 Ollie's Bargain Outlet Beets - Candy Cane, Organic              54.9
## 10 Ollie's Bargain Outlet Veal - Tenderloin, Untrimmed             54.7
## 11 Gabe's                 Pepper - Black, Ground                   54.7
## 12 Gabe's                 Five Alive Citrus                        54.7
## 13 Dd's Discounts         Spice - Greek 1 Step                     54.7
## 14 Big Lots               Beer - Paulaner Hefeweisse               54.6
## 15 Meijer                 Sauce - White, Mix                       54.6
## 16 Ross Stores            Pop Shoppe Cream Soda                    54.5
## 17 Ocean State Job Lot    Lamb - Whole, Frozen                     54.5
## 18 Renys                  Sauce - Salsa                            54.5
## 19 Ross Stores            Cleaner - Comet                          54.4
## 20 Gabe's                 Sunflower Seed Raw                       54.4

We can further filter down to only StoreName = Walmart or Dollar General using a HAVING clause

WHERE clause can’t be used with aggragate functions

Select StoreName, ProductName, AVG(Quantity) AS average_quantity
FROM (Select *
FROM  Test.Inventory
FULL JOIN Test.Products 
ON Inventory.ProductID = Products.ProductID 
FULL JOIN Test.Sales
On Inventory.ProductID = Sales.ProductID)
GROUP BY StoreName, ProductName
HAVING StoreName = "Walmart" OR StoreName = "Dollar General"
ORDER BY average_quantity DESC
LIMIT 20
print(M9_results)
## # A tibble: 20 × 3
##    StoreName      ProductName                       average_quantity
##    <chr>          <chr>                                        <dbl>
##  1 Dollar General Country Roll                                  54.4
##  2 Dollar General Sunflower Seed Raw                            54.1
##  3 Walmart        Pepper - Sorrano                              53.9
##  4 Dollar General Lamb - Loin, Trimmed, Boneless                53.9
##  5 Dollar General Fenngreek Seed                                53.0
##  6 Walmart        Beans - Turtle, Black, Dry                    53.0
##  7 Dollar General Persimmons                                    52.9
##  8 Walmart        Chivas Regal - 12 Year Old                    52.9
##  9 Walmart        Appetizer - Mini Egg Roll, Shrimp             52.8
## 10 Dollar General Potatoes - Mini Red                           52.8
## 11 Dollar General Sole - Fillet                                 52.6
## 12 Walmart        Longos - Cheese Tortellini                    52.5
## 13 Walmart        Ginger - Ground                               52.1
## 14 Walmart        Spice - Paprika                               51.8
## 15 Dollar General Beef - Tongue, Fresh                          51.8
## 16 Walmart        Steamers White                                51.5
## 17 Dollar General Sherry - Dry                                  51.2
## 18 Walmart        Soup - Campbells, Creamy                      51.1
## 19 Walmart        Breakfast Quesadillas                         51.1
## 20 Dollar General Chef Hat 25cm                                 51.1